An assessment of the genomic structural variation landscape in Sub-Saharan African populations

Structural variants are responsible for a large part of genomic variation between individuals and play a role in both common and rare diseases. Databases cataloguing structural variants notably do not represent the full spectrum of global diversity, particularly missing information from most African populations. To address this representation gap, we analysed 1,091 high-coverage African genomes, 545 of which are public data sets, and 546 which have been analysed for structural variants for the first time. Variants were called using five different tools and datasets merged and jointly called using SURVIVOR. We identified 67,795 structural variants throughout the genome, with 10,421 genes having at least one variant. Using a conservative overlap in merged data, 6,414 of the structural variants (9.5%) are novel compared to the Database of Genomic Variants. This study contributes to knowledge of the landscape of structural variant diversity in Africa and presents a reliable dataset for potential applications in population genetics and health-related research.


Introduction
Structural variants (SVs) are rearrangements of genomic content that are typically greater than 50bp in size, including deletions, duplications, insertions, inversions, translocations, and more complex rearrangements.Copy number variants (CNVs) are a subset of SVs that result in a change in the number of copies of a genomic region.SVs are responsible for a large part of genomic variation between individuals [1,2,3] but also play a role in disease, having been implicated in both common diseases such as cancer [4], and rare diseases, including many neurodevelopmental disorders (NDDs) [5,6].Unfortunately, the full characterisation of SVs has been slowed by the complexity and technical challenges of SV discovery.These challenges have resulted in a delay in publicly available high-quality SV databases from large population studies, which in turn has limited our understanding of the precise role of SV in human variation and disease [1].
Understanding the variation of the population informs research on genetic disorders.Early studies to map human SVs and understand their role in human diseases were performed using microarray genotyping technology [7,8,9].These studies began to uncover the role of SVs in disease [10], but did not have the break-point accuracy and high resolution required for the discovery of small SVs [11].With the increasing number of human genomes sequenced using nextgeneration sequencing (NGS) approaches, methods to detect SVs from these data were developed.Studies such as the 1000 Genomes Project [11] and the Genome Aggregation Database (gnomAD) [2] have produced landmark SV datasets that greatly contributed to understanding the global distribution of SVs.
Current SV reference databases are biased towards European populations and contain few African genomes, especially from continental Africans, thus representing only a subset of global genetic diversity [12,13].The most comprehensive African study to date is the study of 232 individuals using medium-coverage (10x) data [14].The lack of knowledge of African genomes hampers the goal of access to genomic medicine initiatives for all global populations.As such, high-quality data sets of baseline variation in African populations will enrich our knowledge of the human genome [13] and help to interpret potentially pathogenic variants in the study of genetic diseases [15,16].Through our comprehensive dataset, we build on prior work to explore the breadth of SV diversity in Africa and establish a reliable reference of African SVs for potential applications in population genetics and health-related research.

Overview of calls of length ≤10,000bp
We first reviewed the calls that are ≤10,000bp in length and are so validated by three of the callers.There are 67,795 SVs ≤10,000bp in length, 36,981 in intergenic regions, and 30,814 in gene regions (including the untranslated regions).A total of 10,421 genes contained SVs.Categorised by type of structural variant, the number of variants are shown below in Table 1.However, note that due to the differences in the way that different callers and SURVIVOR call and handle variants that this is only a rough guide, determined as much by the power of the tools as the actual range of variants.

Novel SVs discovered
Comparison with the Database of Genomic Variants (DGV) [22] was performed using a cut-off of a reciprocal overlap of 70/90/95% and showed 6,414/9,483/11,673 new SVs respectively.Using a conservative 90% reciprocal overlap, 14% (9,483) of the dataset were novel African SVs, not previously described in DGV.This large number of unique variants highlights that African individuals harbour a substantial amount of previously undescribed structural variation.The population structure analysis described below further shows that not only do African individuals have unique variation overall, but many of these variants differ between African regions in sub-Saharan Africa.

Length spectrum
Showing the spectrum of lengths is complex.Multiple samples may have SVs that are slightly different or that the SV caller reports as slightly different.Furthermore, different tools estimate SVs of different lengths.Figure 2 and Table 3 show the distribution of lengths of SV calls and as expected we see the majority of SVs are small with 70% being less than 1,000bp.

Annotation and impact of the calls ≤ 10,000bp in length
Annotation by AnnotSV [23] shows that the majority of the SVs are likely benign, as they do not include genes previously implicated in disease, but that there are a number that may be deleterious.Table 4 summarises predictions made regarding American College of Medical Genetics (ACMG) classification.Table 5 shows the 15 variants predicted to be pathogenic.Of the 15 variants, nine of them are present in population SV databases and almost all these were found uniquely in African individuals in previous studies.With our large African cohort we can add additional frequency data to these variants.The deletion involving PTCH2 present in one individual in this cohort, is an interesting finding.It is accepted that PTCH2 a homolog of PTCH1 may also cause nevoid basal cell carcinoma syndrome, and one such variant is reported in ClinVar as a pathogenic, one-star classified variant (VCV000059967.1).However this association is under dispute as reports of healthy indiviudals with these variants have been published [24,25].This deletion involving PTCH2 may be another such case.The remaining five variants have not been previously reported in population databases and are all singletons except the deletion involving most of GNPTG and the beginning of UNKL.This variant was found in three individuals in the cohort and partially overlaps a variant reported in ClinVar as pathogenic, but is not present in any population databases.Most SNVs in GNPTG are associated with mucolipidosis III an autosomal recessive condition.With a deletion of most of the GNPTG gene these individuals are likely heterozygous carriers for a novel pathogenic African variant.Table 6 summarises how structural variants overlap with genomic regions.Compared to population SNVs an unexpectedly high proportion of variants falls within the range of a gene.Almost all of these are however intronic variants.In their paper Collins et al. [2] showed that selection against variants is highly correlated to size of the SV and the functional impact.They calculated an adjusted proportion of singleton score to compare selection against different SV types, taking into account genic location.Intronic variants like other non-coding variants score close to null, second to intergenic variants, indicating little selection against them.

Population Structure
As expected, SVs differ across the continent.A principal component analysis (PCA) using PLINK2 [26] shows that populations can be differentiated with significant variation from south to centre and east to west -Figure 4 shows the PCA.Although the structure is not as defined (first two eigenvalues are 4.3 and 3.7) as for the SNVs shown in Figure A1 (first two eigenvalues 57 and 13), the structure is clearly visible.

Calls that are >10,000bp in length
This class of calls is reported separately because a different method was used for combining results, and the consequences of false positives with very long SV regions in annotation are more serious.Preliminary investigation showed that there were many SV regions much greater than 10,000bp, some with very strong evidence.While some of these calls are indicative of some structural rearrangement it is highly unlikely that they are as long as the callers indicate.Therefore, a different approach was taken.First, we required that one of the depth-based approaches and another method validate the call.Second, we only analysed calls that are less than 200kbp long, since incorrect longer calls are likely to overlap many genes and pollute downstream analysis.We have however, made BED files available for all calls.
There were 291 SV regions in total.More detail can be found in Figure 5.A disproportionate number of these longer SV regions occur on chromosome X (158).Although with a few exceptions, the SV regions do not occur in the pseudoautosomal regions, this may be due to failure of the methods to properly account for chromosome X.Of the SV regions that do not appear on chromosome X, 41 completely covered a gene.The bed file of all the longer SV regions is available as supplementary data.Applying an ensemble call approach, we identified 67,795 SVs throughout the genome, with 10,421 genes having at least one SV.Categorising SVs by subtype is complex and biased by the capabilities of SV-detection algorithms, but with that caveat, 75% of the variants were deletions, 19% duplications, 4% insertions, 2% inversions, and 12 translocations.
There was considerable variation in SV size, with an expected majority of variants being small.About 41% are between 50-200bp, 25% between 200-800bp, 20% between 800-3,200bp, and 13% between 3,200-12,800bp.No SVs longer than that met our criteria for SV-detection although individual methods were able to detect longer SVs.These proportions support the premise that size is one of the greatest factors affecting selection against SVs [2].Again, the numbers we report are biased by the ability of different algorithms to detect different length SVs, but they do align with expected proportions.
When looking at the frequency of variants detected we see a higher number of common variants than expected.17% occur at greater than 10% frequency.More variants occured at higher frequencies compared to the gnomAD SV study where only 10% of variants had frequencies > 1%.
Given that African ancestry individuals have been known to have a greater number of SVs, this greater number of common variants may be a feature of the African SV landscape.
On average, the individuals had ≈ 3000 SVs; very few individuals varied more than ≈10% from this number.This number is lower than that reported in the gnomAD-SV study for African ancestry genomes.This could be due to the smaller size range included in the analysis.In this study, no variants larger than 12,800bp were included, where in the gnomAD-SV study they resolved variants from 50bp-10Mb.They also utilised different tools which will affect variants detected.
At first glance a surprising proportion of SVs were near genes -42% occurred in introns, 4% affected coding regions with 53% being intergenic.As mentioned earlier, when considering the location of SVs their functional impact and size have to be taken into account.If SVs are small enough to only involve non-coding introns they are unlikely to be highly selected against and therefoore will be found at relatively high proportions.Collins et al. [2] utilised normalised singleton proportions to estimate these selection pressures and indicate that intronic variants are almost as unlikely to be selected against as intergenic regions.
With respect to predicted functionality, 68% of SVs have uncertain impact according to ACMG classification, 25% are predicted as benign, 0.2% likely pathogenic and only 15 are predicted as pathogenic.Given that the participants were all adults recruited as part of population crosssectional or infectious disease studies, the relatively small number of pathogenic predictions is expected.Some of the pathogenic and likely pathogenic variants, identified here occur in a heterozygous form, and are therefore not disease causing.However, recording the frequency of these variants more accuratley remain important, as it gives a clearer indication of the mutation spectrum for the respective diseases that these variants in homozygous form contribute towards.
Using a conservative overlap, 6,414 of the 67,795 variants (9.5%) are new SVs compared to the DGV.One of the aims of our study was to add to the catalogue of known SVs in current references databases, as such databases have an underrepresentation of African genomic data.This reduces their usefulness to researchers studying individuals of African ancestry.High quality datasets of baseline population African variation, and analysis thereof, are of critical importance as they aid in the interpretation of potentially pathogenic variants in the study of genetic diseases in Africa.
In this study we encountered the multiple reasons why SV research has lagged behind SNV research.Technical and analysis challenges complicate accurate calling of SVs using short read sequencing technologies.Short read data (e.g., 150bp even if paired end) is not ideal for studying SVs.There are several signals that can be used -e.g., read depth, different parts of reads aligning to different places on a genome, unexpected difference where two paired-end reads align to the reference -but variants of increased length are more difficult to detect accurately.While moves to pan-genomes and reference graph representations may partially assist, there is no doubt that high quality long-read sequencing data is much better suited for SV detection, especially in understudied populations [27].nDifferent SV-detection algorithms have different advantages as different approaches are employed.Given what is known of the relatively high error-rate (false positives and -negatives), ensemble calling has become common.How to combine callers is not straight-forward as different tools work better for different SV lengths.For example, algorithms that use read-depth as a criterion for CNVs are suitable for detecting variants several thousand bp in length, while other tools that use read content work well for SVs that are 50-70bp in length.Thus simply combining two such tools by requiring that both tools must detect a variant is bound to miss the variant while allowing either tool to detect the variant improves sensitivity but does not impact on false positive rate.We took a conservative strategy -as discussed in Section 5.3.3.Even SV calls with strong evidence using two different tools may lead to incorrect conclusions.For variants less than 1,000bp we requiried that three of four tools must detect an SV.This strict approach meant that in some cases variants called by fewer than three tools in an individual did not proceed to the final merging step, resulting in lower variant frequencies for known common variants.
Each SV-caller is very computationally intensive.Some tools only accept BAM files with individual samples in the range 70-90GB in size.Tools often produce intermediate data files that require significant extra disk space.All algorithms require processing the BAM files once and some twice, and the detection algorithms are complex, so this project has used several hundred thousand CPU hours and hundreds of TB of data.Exploring different approaches and parameters with a significant sample size is therefore a computational challenge.These challenges have significantly hindered the discovery of the African SV landscape and therefore we are making this SV dataset available to the global research community as an important resource to enhance the discovery of functional impact of SVs in the genome.

Conclusion
This paper introducing a database of SVs in African populations, contributes to knowledge of the landscape of human diversity with a focus on Africa.We have identified a large number of SVs previously undiscovered in other populations that, together with existing data sets, will assist future work into understanding the functional impact of genomic variation and will become a valuable resource for researchers and clinicians studying individuals with African ancestry.

Data sets
We used high coverage (> 30×), short read, WGS data from 1,091 individuals obtained from various sources, including the Human Heredity and Health in Africa (H3Africa) consortium [28], the 1000 Genomes Project [11], Simons Genome Diversity Project [29], the South African Human Genome Programme [30] and other unpublished African resources.Table 7 gives an overview of the populations represented in this study, and is further described in Appendix A. Figure A1 shows the regional distribution of the included datasets.

Data alignment to reference
The 1000 Genome Project VCFs were created using the protocol described in [31].We applied the same protocol to all other samples.Briefly, reads were aligned with build GRCh38 using BWA (v.0.17-r1188) [32].Finally, we performed alignment duplicate marking with GATK MarkDuplicates (v4.1.3)and quality score recalibration with GATK Quality Score Recalibrator (v4.1.3)to improve the quality of the alignment BAM file.

Variant calling
We used several SV and CNV calling tools and produced a consensus dataset, with the Nextflow scripts used accessible here: (https://github.com/shaze/h3acnvalls).We first describe the methodological approach for consensus calling and then detail the calling and merging steps.

Methodological approach
Producing a dataset from many samples using multiple tools requires merging both across samples and tools.Ideally, calling across samples would be done jointly to improve quality of calls and produce cleaner datasets.However, not all calling tools are capable of joint calling, especially on larger datasets.There are three complications in producing a merged dataset.First, SVs and CNVs vary more than single nucleotide variants (SNVs) -both in the type of variant and in the amount of the genome affected by the variant.For example, two samples may have insertions in overlapping regions of the genome that may not be identical.Whether these should be recorded as one or as two SV regions depends in principle on how similar the insertions are (do they differ in content very slightly and so likely are products of the same evolutionary event, or very differently and to what degree do the called regions overlap).Second, the difficulties in calling and limitations of calling tools and random variation in the sequencing of different samples (especially with short reads) means that even if two samples have exactly the same variant, a caller may report them slightly differently.Third, SVs and CNVs are complex, and the various tools used for merging have limits and are not always regularly updated.A simple case is that one caller may call a duplication (DUP) and another an insertion (INS), CNV, or breakpoint (BND) with minor differences in the actual sequencing.More complicated variant annotations (BND, translocation [TRA], inter-chromosomal translocation [CTX], intra-chromosomal translocation [ITX]) are challenging because a caller may produce an annotation that the other does not recognise, and the rules for how combinations are performed may be hidden in code.These considerations impact the decisions made in the protocols described below.Furthermore, even tools capable of "joint calling" produce outputs that require postprocessing.Joint calling is generally done by first running the tool independently on each sample, merging all the sites where SVs were detected across all samples, and then regenotyping the individual BAMs on the merged site list.For example, a tool detects a duplication in sample A on chromosome 1 at location 10,000 to 10,300, and also detects a duplication in sample B on chromosome 1 at location 10,010 to 10,298.In reality, these two SVs are likely the same SV, but the tools treat them as independent SVs.In the re-genotyping step both sites are checked for all samples -there is a high-probability that both SVs will be called on both samples.On sample A, the first deletion may have a better quality score than the second deletion, but the second deletion will also have a very good score.To remedy this, we use 'stitching' as described below.

Calling
Many tools have been developed to call SVs from short-read WGS data.As each tool has its own limitations and advantages, the use of a variety of SV/CNV-calling algorithms can generate a more comprehensive dataset.As such, SV/CNV calling was performed using several different tools.We then combined the results from the different tools and performed joint calling using SURVIVOR (Figure 6).
GRIDSS2 extracts reads that are potentially indicative of SVs from BAM files (split reads, discordantly aligned reads, soft-clipped reads and reads anchored only on side of the read).A positional de Bruijn graph is constructed and analysed to identify the SVs.We followed the recommended approach for calling germline SVs except that (a) we did not do joint calling (since this is done with SURVIVOR and the GRIDSS developers suggest that joint calling of hundreds of samples is computationally infeasible) or exclude any regions; and (b) we did not exclude the ENCODE Data Analysis Center blacklist of problematic regions in the genome [33] since we did this jointly across all data calls.We used the GRIDSS2-recommended post-analysis of break-end (BND) variants to determine simple variant types like INS, DEL and DUP and used Truvari [34] to merge variants that were close to each other.
Delly performs SV calling using a combination of split-read and read-pair information.Briefly, for both Delly algorithms individually, the process is: (a) each sample is separately called; (b) all sites found across all individuals are merged; (b) each sample is called again using the common sites to improve accuracy; (c) the resultant calls are merged and (d) filtered using Delly's SV filter, and then (e) split into individual files for later processing.Delly has a separate read-depth CNV calling method --the results of the SV caller are used to refine the start and end points of the variant.We used the recommended protocol for detecting germline SVs with Delly: SVs are called using the BAM files of individual samples, and the call sites between samples are merged.The individual BAM files are regenotyped using the combined merged list.For CNV calling, each BAM is called independently using the SV calls from that sample as auxiliary information, the sites are merged and then each BAM is re-genotyped on the combined site list.
Manta [17] uses break-end information as its primary SV detection mechanism.SVs are detected per sample; there is no type of joint calling.The first phase of the algorithm builds a break-end graph using the aligned reads.Once the break-end graph is created, reads aligning to each break are analysed and assembled, and evidence for different types of SVs is evaluated.
Smoove [20] is a wrapper and workflow for LUMPY [35].LUMPY uses read-pair, split-read, and read-depth information to make probabilistic calls of SVs.Joint calling is supported.We followed the recommended workflow of (a) calling SVs on each BAM separately; (b) merging all called sites; (c) re-genotyping each BAM using the merged site list; and (d) combining ("smoove paste").We then annotated and filtered out poor quality calls.
Counts of variants for individual tools For tools that support joint calling, this can be computed directly from the joint output.For GRIDSS and Manta, for which joint calling is not supported, after removing poor quality calls the method proposed by Truvari [34] was followed: the VCF files were merged using bcftools 1.15 [36].The resulting file was split into deletions and non-deletions and then Truvari was used on each file to merge SVs from different samples that overlap.The resulting files were then merged and sorted.The parameters can be found on our Nextflow scripts.
CNVPytor is primarily a depth-based CNV caller.We computed read-depth and called CNVs independently on each BAM and using a window size of 1,000 bp.Although, CNVPytor's merge in principle supports some sort of joint calling we were unable to produce useful output.The results of CNVPytor's calling is in windows of size 1,000 bp resulting in imprecise variant boundaries, which impacted how we combined results.

Merging results
When combining the subsequent variant lists from each tool, we only used variants that passed quality control according to each tool specification.Although there is an argument for including variants with weaker evidence when doing ensemble calling, we chose to be strict.After preliminary experimentation, we used the hybrid approach for ensemble calling, depending on variant length, shown in Figure 6.After preliminary analysis, we filtered out insertions greater than 1,000bp, deletions greater than 10,000bp, inversions greater than 20,000bp and duplications greater than 20,000bp.
For variants that are ≤ 10,000bp, we use SURVIVOR [37] to select variants that were supported by any three of the following tools: Manta, Delly/SV, GRIDSS and smoove.There are relatively few variants larger than this cutoff point.However, we excluded longer variants: manual inspection of examples showed very high-quality evidence for some sort of SV, but the lengths were implausible and in some cases caused thousands of genes to be implicated in an SV.The risk of false positives or misinterpretation outweighs the value of such calls.Different support values s (2, 3, 4) were tested, which means how many tools called the same variant.For each sample, SURVIVOR was used to combine the results of the four different tools so that at least s of the tools supported the call for a variant.SURVIVOR then combines the per-sample results to produce an overall jointly called file.Finally, the ENCODE DAC regions are excluded.For the main analyses, a variant must be called by at least three tool (s = 3) was used.Preliminary analysis showed that different tools annotate variants differently (e.g., INS, DUP, BND).As such, we used SURVIVOR's mode of ignoring the variant type to decide to call a consensus.
For longer variants (>10,000bp), we required that they be called by CNVPytor and at least one of the other callers.This means that such calls are supported by both read-depth and sequence content.As CNVPytor has very imprecise calls, SURVIVOR was inappropriate, and we used our own script validate call.Therefore, we took all variants longer than 10,000 bp found by Manta, Delly/SV, Delly/CNV, and smoove and selected those that had at least a 90% overlap with a call from CNVPytor.GRIDSS calls were all shorter than 10,000bp.

Variant annotation
Annotation of the SVs ≤ 10,000 bp was performed using AnnotSV [23].AnnotSV reports whether SV regionss are intergenic, coding, or in another genic or non-genic region.It also predicts variant classification according to the ACMG guidelines [38].We merged SV regions that overlapped by at least 70%, covered the same genes, and had the predicted SV type (either identical or both [DUP, CNV] or [DEL, CNV]).For larger SV regions, AnnotSV was unable to successfully complete the annotation.
African Wits-INDEPTH Partnership for the Genomic Reseearch (AWI-Gen) The AWI-Gen study is a longitudinal study of over 10,000 participants investigating the interplay between genetic and environmental risk factors in cardiometabolic diseases [39].The samples from Burkina Faso described above come from this project.The EGAD00001007589 data set comprises WGS from 100 of these randomly selected healthy individuals who were all South-Eastern Bantu speakers living in South Africa, and were sequenced by a different sequencing centre.Samples were prepared using a PCR-free method and sequenced on an Illumina HiSeq X Ten instrument with a minimum read depth of coverage of 30x.
Southern African Human Genome Programme EGAD00001003791 contains 15 samples from Bantu-speaking individuals.Samples were prepared using PCR-free methods for sequencing on the Illumina HiSeq 2000 instrument with a minimum read depth of coverage of 30x [30].
Simons Genome Diversity Project Twenty-five samples were selected from the 300 SSA individuals in the Simons Genome Diversity Project.These individuals were from Botswana, Congo, Kenya, Namibia, Nigeria, Senegal, South Africa, Sudan.Samples were sequenced at an average depth of 43x at Illumina Ltd.; almost all samples were prepared using the same PCR-free library preparation [29].
1000 Genomes Project Consortium A total of 500 African samples were included from the expanded 1000 Genomes Project where 3 202 samples were sequenced with deep coverage.Participants were Gambian Mandinka, Mende from Sierra Leone, Yoruba from Ibadan, Nigeria, Esan from Nigeria and Luhya from Webuye, Kenya.Samples were prepared using Truseq DNA PCRfree (450bp) Library Preparation Kit and sequenced on an Illumina Novaseq 6000 sequencer to an average depth of 30x [40]).

Fig. 3 :
Fig. 3: Overview of number of variants per individual.

Fig. 6 :
Fig.6: Overview of merging process: Variants that are less than 10,000bp in length must be supported by at least 3 tools.Variants greater than 10,000 bp must be supported by CNVPytor, a depth-based method, and one other tool.
Global Population PCA -the African samples were used in our study.The others show context Focus on our samples, with geographical regions.Some populations and outliers were removed for clarity.

Table 1 :
Number of variants by SV type Fig.1: Overview of the calls found by each SV caller, and combining them using SURVIVOR.See the methods section for details of choices.Note that in this graph we abuse the standard UpsetPlot representation -for example, 49,000 variants are detected by both Delly and Manta which may or may not be detected by other algorithms (in the standard UpsetPlot the implication would be that they are not detected by other approaches).Note also the anomalous results for gridss and smoove is caused by the fact that joint calling is not done by these tools and so estimating the actual number of SVs is very difficult.

Table 2 :
Overview of frequencyAs is expected most SVs occur at low frequency (22,551 singletons), 38% have a frequency less than 1%, however, 17% occur at greater than 10% frequency.More variants occured at higher frequencies compared to the gnomAD SV study where only 10% of variants had frequencies > 1%.

Table 2
below gives an overview of frequency.

Table 4 :
Number of SVs by ACMG predicted impact

Table 5 :
Genes with predicted pathogenic variants.In some cases there are two variants in a gene, which are listed in separate rows.In some cases, a variants overlaps two genes, listed in the same row.

Table 6 :
Genomic location of variants

Table 7 :
Sources of high-coverage WGS data for this study on 1,091 participants from multiple African populations.Depth=median depth of mapped sections of chromosomes 1-22.Data access details are given in the Data availability section.